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Abstract 

o 

The problem of finding groups in data (cluster analysis) has been extensively stud- 
^ led by researchers from the fields of Statistics and Computer Science, among others, 

h— ^ However, despite its popularity it is widely recognized that the investigation of some 

theoretical aspects of clustering has been relatively sparse. One of the main reasons for 
this lack of theoretical results is surely the fact that, unlike the situation with other 
statistical problems as regression or classification, for some of the cluster methodolo- 
gies it is quite difficult to specify a population goal to which the data-based clustering 
algorithms should try to get close. This paper aims to provide some insight into the 
theoretical foundations of the usual nonparamctric approach to clustering, which un- 
derstands clusters as regions of high density, by presenting an explicit formulation for 
the ideal population clustering. 
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1 Introduction 

Clustering is one of the branches of Statistics with more research activity in the recent 



years. As noted by Meila (2007), "clustering is a young domain of research, where rigorous 



methodology is still striving to emerge" . Indeed, some authors have recently expressed their 



concerns about the lack of theoretical or formal developments for clustering ( 


von Luxburg 


and Ben-David 


2005 


Ackerman and Ben-David , |2008 


Zadeh and Ben-David 


2009 


). This 



paper aims to contribute to this regularization (or, say, rigorousization) . 

Stated in its most simple form, cluster analysis consists in "partitioning a data set into 
groups so that the points in one group are similar to each other and are as different as 



possible from the points in other groups" (Hand, Mannila and Smyth 2001 p. 293). Posed 



as such, the problem does not even seem to have a statistical meaning. 

In fact, some clustering techniques are solely based on the distances between the obser- 
vations. Close observations are joined together to form a group, and extending the notion of 
inter-point distance to distance between groups, the resulting groups are gradually merged 
until all the initial observations are contained into a single group. This represents, of 



course, the notion of agglomerative hierarchical clustering (Izenman, 2008, Section 12.3). 
The graphical outcome depicting the successive agglomeration of data points up to a single 
group is the well-known dendrogram, and depending on the notion of inter-group distance 
used along the merging process the most common procedures of this type are known as sin- 



gle linkage, complete linkage or average linkage (see also Hastie, Tibshirani and Friedman 



2()09t p. 523). 

A first statistical flavour is noticed when dealing with those clustering methodologies 
that represent each cluster by a central point, such as the mean, the median or, more 
generally, a trimmed mean. This class of techniques is usually referred to as partitioning 



methods, and surely the most popular of its representatives is means (MacQueen, 1967) 



For a pre-specified number K of groups, these algorithms seek for K centers with the goal 



of optimizing a certain score function representing the quality of the clustering ( Everitt et 



aq [20TTl Chapter 5). 

A fully statistical notion of the problem is distribution-based clustering. This ap- 
proach further assumes that the data set is a sample from a probability distribution. As 
with all the statistical procedures, there are parametric and nonparametric methodologies 
for distribution-based clustering. Surely the golden standard of parametric clustering is 



achieved through mixture modeling as clearly described in Fraley and Raftery (2002). It 
is assumed that the distribution generating the data is a mixture of simple parametric dis- 
tributions, e.g. multivariate normal distributions, and each component of the mixture is 
associated to a different population cluster. Maximum likelihood is used to fit a mixture 
model and then each data point is assigned to the most likely component using the Bayes 
rule. 
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Eventually, by using mixture distribution components defined through Mahalanobis 
distances, some partitioning methods and mixture-model clustering can be regarded to 
have the same motivation, and under this point of view some partitioning techniques have 
evolved to reach a high level of maturity, leading to very sophisticated methods. See, for 
instance, the recent review of Garcia-Escudero et al. (2010) and references therein. 

In parametric distribution-based clustering the notion of population cluster is clearly 
defined through the components of the mixture model. If the true underlying distribution 
belongs indeed to the postulated mixture model, this identification is certainly quite ad- 
equate. However, often mixture models are used in a "nonparametric way", on the basis 
that they usually represent a dense subclass of the set of all continuous distributions and, 
therefore, they are useful to approximate arbitrarily well any continuous distribution. In 
that case, the identification of components with clusters may not coincide with the natural 
intuition. For instance, several normal components may be needed to model accurately a 
distribution with a heavier-than-normal tail, yet the resulting normal mixture might as well 
be a unimodal density, suggesting the existence of a single group, though with a non-normal 
distribution shape. The fact that mixture components and clusters are not necessarily the 
same has motivated several variations of mixture-model clustering to amend this deficiency; 



see, for example, Biernacki, Celeux and Govaert (2000), Baudry et al. (2010) or Hennig 

poiol ). 

The previous example is helpful to understand the motivation of the nonparametric 
methodology, which is based on identifying clusters as regions of high density separated 
from each other by regions of lower density. In this sense, population clusters are naturally 
associated with the modes (i.e., local maxima) of the probability density function, and this 



nonparametric approach is denominated mode-based clustering or modal clustering (Li, 



Ray and Lindsay, 2007). Precisely, each cluster is usually understood as the "domain of 



attraction" of a mode (Stuetzle, 2003). 

The concept of domain of attraction is not that simple to specify, and providing a precise 
definition is one of the goals of this paper that will be explored in the next section. It is 
perhaps this imprecise notion of population cluster in the nonparametric case what does 
not fully convince many researchers and statisticians and lead them to adopt one of the 
alternative previously expounded techniques (parametric clustering, partitioning methods 
or hierarchical algorithms). 

The first attempt to make the goal of modal clustering precise is through the notion of 
level sets (Hartigan, 1975). If the distribution of the data has a density /, given A > the 
A-level set of / is defined as L{X) = {x : f{x) > A}. Then, population A-clusters are defined 
as the connected components of L(A), a definition that clearly captures the notion of groups 
having a high density. An extensive account of the usefulness of level sets in applications is 
given in 



Mason and Polonik (2009). 



The main drawback of this approach is the fact that the notion of population cluster 
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Figure 1: Univariate trimodal density for which it is not possible to capture its whole cluster 
structure using a level set analysis based on a single level. 



depends on the level A, as recognized by Stuetzle (2003). However, other authors like 



Cuevas, Febrero and Fraiman (2001) or Cadre, Pelletier and Pudlo (2011), affirm that the 



choice of A is only a matter of resolution level of the analysis. Indeed, the level sets can 



be re-parameterized in terms of their probability content, as in Cadre (2006) or Azzalini 



and Torelli] (2007), for ease of interpretability. This is usually done by finding, for any 
p G (0, 1), the threshold Ap defined as the largest value of A such that the probability of 
L(A) is greater than p, and setting R{p) = L{Xp) so that the level set R{p) has probability p 
as long as the border of R{p) has null probability. Then, the chosen value of p implies that 
a proportion a = 1—p oi the data is left out of the clustering mechanism, and therefore the 
level A (or p, or a) can be interpreted as a trimming parameter motivated by robustness 
considerations. In fact, the parametrization of the level sets in terms of the probability a of 



Hyndman 


(1996 


); 


Baillo, 



Cuesta-Albertos and Cuevas (2001) or Samworth and Wand (2010). 



Nevertheless, it is easy to think of many examples in which it is impossible to observe 



the whole cluster structure on the basis of a single level A. Essentially as in Rinaldo et 
al. (2012, p. 906), Figure [T] shows a simple univariate example of this phenomenon: three 
different groups are visually identifiable, yet none of the level sets of the density has three 
connected components. To amend this, the usual recommendation is to analyze the cluster 
structure for several values of the level A (or the coverage probability p). Graphical tools 
oriented to this goal are the cluster tree ( [Stuetzle , 2003 ) , or the mode function ( Azzalini and 



Torelli , 2007 ) . Both graphics are useful to show how the clusters emerge as a function of A 
or p, respectively. A detailed description of the cluster identification methodology through 
the cluster tree is given in Section [2] below. 

Level sets also arise as the maximizers of excess mass functionals. For a given value of 
A > 0, the excess mass functional is defined as E\{A) = P{A)—Xn{A) for any measurable set 
A, where denotes the Lebesgue measure. Hartigan (1987]) and Miiller and Sawitzki (1987 



1991 ) independently introduced this functional and noticed that it reaches its maximum 



value precisely at -L(A). A nice feature of the excess mass functional is that it admits 
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a fairly simple empirical estimator obtained by replacing P for the empirical probability 
measure of the sample. However, the maximizer of this empirical criterion does not provide 
a useful estimate of L{X) because it turns out to be just the set of all the data points. 
Nevertheless, by restricting the empirical functional to a suitable class of measurable sets 
its maximizer can be shown to be a sensible estimate of L{X) within such a class. This 



estimation procedure is studied in detail in Polonik ( 1995 ) making use of empirical processes 
theory. 

On the other hand, the estimation of density level sets using smoothing techniques has 



its roots in the papers of Cuevas and Fraiman (1997); Tsybakov (1997) or Walther (1997), 



and related data-driven cluster methodologies based on the notion of A-clusters are proposed 



in 


Cuevas, Febrero and Fraiman 


(2001); 


Stuetzle (2003 


); 


Azzalini and Torelli 


(2007) 


Nugent 


and Stuetzle 


(2010 


); 


Stuetzle and Nugent (2010 


); 


Rinaldo and Wasserman 


(2010 


) or 


Rinaldo 


et al. 


(2012 


), among 


others. In all these papers a plug-in approach is adopted, 


in the sense 



that a population level set is estimated by the corresponding level set of some density 
estimator. It is to be noted that for most of the plug-in estimators, and also for the excess 
mass estimators, the computation of empirical level sets is indeed a difficult issue, and 
many of the aforementioned references also present intricate algorithms to approximate 
the corresponding level set estimators. Relatedly, in Machine Learning literature surely 



the most popular algorithms for density-based clustering include DBSCAN (Easter et al 



1996), based on a single level, and OPTICS (Ankerst et al, 1999), which is a generalization 



of the former which allows to compute the cluster tree induced by a range of levels. 



2 A precise approach to the goal of modal clustering 

Certainly, the very concept of modal cluster may admit some well-supported criticism, as 



illustrated in the thorough reflections of Stuetzle (2003, Section 6) or Hennig (2010), and 



we agree with Izenman (2008, p. 408) that "there is no universally accepted definition 
of exactly what constitutes a cluster". Some of the most common definitions have been 
surveyed in the previous section. However, it is not the goal of this paper to contribute 
to the discussion of the different notions of cluster, but to introduce a precise definition 
of population modal cluster which does not rely on the concept of level set. This aims 



to provide an answer to Question 1 in von Luxburg and Ben-David (2005): "How does 



a desirable clustering look like if we have complete knowledge about our data generating 



process?" 

Since the empirical formulation of the clustering task consists of partitioning a data set 
into homogeneous groups, a population clustering S of a probability distribution P on M*^ 
should be understood as a partition of M'^ into mutually disjoint measurable components 



Ci, . . . , Cfc, each with positive probability content (Ben-David, von Luxburg and Pal, 2006). 



The components of such a partition are called population clusters. Thus, two population 
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Figure 2: Identification of clusters for the trimodal density example using the cluster tree. 



clusterings C and 2) are identified to be the same if they have the same number of clusters 
and, up to a permutation of the cluster labels, every cluster of C and its most similar match 
in T> differ in a null-probability set (more details on this are elaborated in Section [s]). 

As anticipated in the previous section, a basic principle about the concept of ideal pop- 
ulation clustering is that it should reflect the notion of a partition into regions of high 
density separated from each other by regions of lower density, leading to the idea of popu- 
lation cluster as the domain of attraction of a density mode. We begin our way to formalize 
this concept by showing some illustrative examples in one and two dimensions. 

In the one-dimensional case, it seems clear from Figure [2] how this can be achieved. To 
begin with, the level set methodology identifies the three clusters in the density depicted in 
Figure [1] by computing the cluster tree as described clearly in Nugent and Stuetzle (2010): 
starting from the 0-level set, which corresponds to the whole real line in this example and 
consists of a single connected component, A is increased until it reaches Ai, where two 
components for the Ai-level set are found, G'l and G2, resulting in the cluster tree splitting 
into two different branches (see Fig. [2| panel (a)). These two components G'l and G2 are 
usually called cluster cores. They do not constitute a partition of M, but the remaining 



parts F[ and Fj, referred to as fluff in Nugent and Stuetzle (2010), can be assigned to either 



Figure 3: Bidimensional example, with two groups clearly identifiable at an intuitive level. 

the left or the right branch depending on whichever of them is closer. Thus, at level Ai the 
partition M = C( U C2 is obtained. The point dividing the line into these two components 
can be arbitrarily assigned to either of them; this assignment makes no difference, because 
it leads to equivalent clusterings. 

At level A2 the left branch is further divided into two branches (see panel (b) of Fig. [2]). 
Again, the two cluster core components G'l and G2 do not form a partition of the set C[ 
associated with the previous node of the tree, but it is clear how the fluff F[' and F2 can 
be assigned to form a partition U of C[. Since no further splitting of the cluster tree 
is observed as A increases, the final partition of M is U C2 U renamed to Ci U C2 U C3 
in panel (c) of Figure [2] 

It is immediate to observe that the levels at which a connected component breaks into 
two different ones correspond precisely to local minima of the density function, so an equiv- 
alent formulation consists of defining population clusters as the connected components of M 
minus the points where a local minimum is attained (the solid circles in panel (c) of Figure 
[2]) . Notice that this definition does not involve the computation of level sets for a range of 
levels, nor their cores and fluff, so that it constitutes a more straightforward approach to 
the same concept in the unidimensional setup. 

To get an idea of how to generalize the previous approach to higher dimensions, consider 
the following extremely simple bidimensional example: an equal-proportion mixture of two 
normal distributions, each with identity variance matrix and centered at fXi = (— 1,0) and 
fj,2 = —f^i, respectively. At an intuitive level, it is clear from Figure[3]that the most natural 
border to separate the two visible groups is the black line. The problem is: what is exactly 
that line? Is it identifiable in terms of the features of the density function in a precise, 
unequivocal way? A nice way to answer to these questions is by means of Morse theory. 



Figure 4: The three possible configurations around a critical point of a Morse function in 
the bidimensional case. 



Morse theory is a branch of Differential Geometry that provides tools for analyzing the 
topology of a manifold M C M"^ by studying the critical points of a differentiable function 
/: M — 7- M. A classical reference book on this subject is Milnor (1963) and enjoyable 
introductions to the topic can be found in Matsumoto (2002) and Jost (2011, Chapter 7). 
A useful application of Morse theory is for terrain analysis, as nicely developed in Vitalli| 
(2010). In terrain analysis, a mountain range can be regarded as the graph of a function 
/: M — )• M, representing the terrain elevation, over a terrain M C M^, just as in the left 
graphic of Figure [3} The goal of terrain analysis is to provide a partition of M through 
watersheds indicating the different regions, or catchment basins, where water flows under 
the effect of gravity. 

The fundamentals of Morse theory can be extremely summarized as follows. A smooth 
function / : M — )• M is called a Morse function if all its critical points are non-degenerate. 
Here the critical points of / are understood as those xo G M for which the gradient Df{xo) 
is null, and non-degeneracy means that the determinant of the Hessian matrix H/(a;o) is 
not zero. For such points the Morse index m{xo) is defined as the number of negative 
eigenvalues of H/(a;o). 

Morse functions can be expressed in a fairly simple form in a neighborhood of a critical 
point xq, as the result known as Morse lemma shows that it is possible to find local coor- 
dinates xi, . . . ,Xn such that / can be written as itxi ± ■ ■ ■ ± Xd + c around a^o, where the 
number of minus signs in the previous expression is precisely m{xo). For example, for d = 2 
the three possible configurations for a critical point are shown in Figure |4j corresponding 
to a local minimum, a saddle point and a local maximum (from left to right), with Morse 
indexes 0, 1 and 2, respectively. 

The decomposition of M suggested by Morse theory is made in terms of the unstable 
and/or stable manifolds of the critical points of / as explained next. Consider the initial 
value problem defined by the minus gradient vector of /. For a given value x G M at 
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time t = 0, the integral curve i^a; : R — )■ M of such initial value problem is the one satisfying 

u'^it) = -Df{u^{t)), v^{0) = x 

and the set of all these integral curves is usually referred to as the negative gradient flow. 
Since the minus gradient vector defines the direction of steepest descent of /, these curves 
(or properly speaking, their images through /) represent the trajectories of the water flow 
subject to gravity. 

With respect to the negative gradient flow, the unstable manifold of a critical point 
is defined as the set of points whose integral curve starts at xq , that is 

VV^ixo) = \x £ M: lim Uxit) = xo]. 

t— >— oo 

Analogously, the stable manifold of Xq is the set of points whose integral curve finishes at xq, 



that is, Wl{xo) = {x e M : limt_>+oo i^a;(i) = a^o}- It was firstly noted by Thom (1949) 
that the set of unstable manifolds corresponding to all the critical points of / provides 
a partition of M (the same is true for the stable manifolds). Furthermore, the unstable 
manifold W^Ixq) has dimension ■m{xQ). 

The main contribution of this section is to define the population clusters of a density / 
as the unstable manifolds of the negative gradient flow corresponding to local maxima of /. 
Or in a more prosaic way, in terms of water flows, a cluster is just the region of the terrain 
that would be flooded by a fountain emanating from a peak of the mountain range. 

Although this is an admittedly cumbersome definition, going back to Figure |3] it is clear 
that it just describes the notion that we were looking for. The critical point xq = (0, 0) 
is a saddle point, thus having Morse index 1, and the black line is precisely its associated 
unstable manifold, W!^{xq) = {0} x M, which is a manifold of dimension 1. The remaining 
two critical points are local maxima, and their respective unstable manifolds are W^{xi) = 
(—00,0) X M and W^{x2) = (0,oo) x M, manifolds of dimension 2 so that we can partition 
M? = W!!:{xq)UW^{xi)UW^{x2), showing W^(a;i) and Ty"(£C2) as two population clusters 
separated by the border W"(a;o), which is a null-probability set. 

Notice that this definition also applies to the previous univariate example in Figure [2j 
the clusters Ci, C2 and C3 are just the unstable manifolds of the three local maxima (they 
are manifolds of dimension 1), and for the two local minima their unstable manifolds have 
dimension 0, so they include only the respective points of minima. 

Moreover, if we focus on the gradient flow, instead of the negative gradient flow, then 
its integral curves satisfy 

7Ui) = D/(7.W), 7.(0) = a;, 

the unstable manifold for the negative gradient flow becomes the stable manifold for the 
gradient flow and viceversa. Therefore, we could equivalently define the cluster associated 
to a mode xq of the density as its stable manifold with respect to the gradient flow, that 
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is, W^{xo) = {a; G M: limt_j>oo Ixi^) — ^o}- This is a precise formulation of the notion of 
domain of attraction of the mode xq, since W^{xq) represents the set of aU the points that 
chmb to Xq when they follow the steepest ascent path defined by the gradient direction. 

3 Comparing population clusterings 

Once the population goal of the clustering problem has been clearly identified as the parti- 
tion of M*^ given in terms of the stable manifolds of the density modes with respect to the 
gradient flow, it is natural to wonder how this population clustering can be estimated from 
the data, how the error between a data-based clustering and the population clustering can 
be measured and to set up a notion of consistency. Those are the goals for this section. 

At this point it is worth distinguishing between two different, although closely related, 
concepts. When the density / is unknown, and a sample Xi, . . . , X„ drawn from / is given, 
any procedure to obtain a data-based partition of M.'^ will be called a data-based clustering. 
Notice, however, that when data are available the goal of most clustering procedures is just 
to partition the data set, and not the whole space M'^. This will be referred to henceforth 
as a clustering of the data. Naturally, the ideal population clustering or any data-based 
clustering immediately result in a clustering of the data, by assigning the same group to 
data points belonging to the same component of the given partition of W^. 

A good deal of measures to evaluate the distance between two clusterings of the data have 



been proposed in the literature. The work of Meila (2007) provides both a comprehensive 



survey of the most used existing measures as well as a deep technical study of their main 



properties, and for instance Arable and Boorman (1973) or Day (1981) provide further 



alternatives. But apart from some exceptions, as the Hamming distance considered in Ben- 



David, von Luxburg and Pal (2006), it should be stressed that all these proposals concern 



partitions of a finite set. However, since the ideal population goal of clustering introduced 
here is defined as a partition of W^, it is necessary to introduce distances between two 
partitions of M*^ to evaluate the differences between two clusterings. 



3.1 A distance in measure between clusterings 

Let C and T) be two clusterings of a probability distribution P, and assume for the moment 
that both have the same number of clusters, say S = {Ci, . . . ,Cr} and T> = {Di, . . . , Dr}. 
The first step to introduce a distance between C and D is to consider a distance between 
sets. Surely the two distances between sets most used in practice are the Hausdorff distance 



and the distance in measure; see Cuevas and Fraiman (2010). The Hausdorff distance is 



specially useful when dealing with compact sets (it defines a metric in the space of all 
compact sets of a metric space), it tries to capture the notion of physical proximity between 



two sets (Rodn'guez-Casal 2003). In contrast, the distance in measure between two sets C 
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and D refers to the content of their symmetric difference CAD = (C H D"^) U {C^ D D), and 
it defines a metric on the set of all measurable subsets of a measure space, once two sets 
differing in a null-measure set are identified to be the same. 

Although we will return to the Hausdorff distance later, our first approach to the notion 
of distance between C and D relies primarily on the concept of distance in measure, and the 
measure involved is precisely the probability measure P. From a practical point of view, it 
does not seem so important that the clusters of a data-based partition get physically close 
to those of the ideal clustering. Instead, it is desirable that the points that are incorrectly 
assigned do not represent a very significant portion of the distribution. Thus, it seems 
natural to affirm that two clusters C G C and D € (resulting from different clusterings) 
are close when P{CAD) is low. 

Therefore, for two clusterings C and D with the same number of clusters, a sensible 
notion of distance is obtained by adding up the contributions of the pairwise distances 
between their components, once they have been relabelled so that every cluster in C is 
compared with its most similar counterpart in D. In mathematical terms, the distance 
between C and D can be measured by 

r 

di(e,D) = min VP(QAZ^,(i)), (1) 

aeVr — • 
1=1 

where Vr denotes the set of permutations of {1, 2, . . . , r}. 

It can be shown that di defines a metric in the space of all the partitions with the 
same number of components, once two such partitions are identified to be the same if they 
differ only in a relabelling of their components. Moreover, the minimization problem in ([T]) 
is usually known as the linear sum assignment problem in the literature of Combinatorial 
Optimization, and it represents a particular case of the well-known Monge-Kantorovich 
transportation problem. A comprehensive treatment of assignment problems can be found 
in 



Burkard, Deh'Amico and Martello (2009) 



If a partition is understood as a vector in the product space of measurable sets, with 
the components as its coordinates, then di resembles the Li product distance, only adapted 
to take into account the possibility of relabelling the components. This seems a logical 
choice given the additive nature of measures, as it adds up the contribution of each distance 
between the partition components as described before. However, it would be equally possible 
to consider any other Lp distance, leading to define 



r 

dp{Q,D) = min { P(a AI)<,(,)f} 



1/p 



for p>l and also (ioo(C, 23) = miuo-e-p^ max{P(Cj ADg-^j)) : z = 1, . . . , r}. The minimization 
problem defining doo is also well known, under the name of the linear bottleneck assignement 
problem, and its objective function is usually employed if the interest is to minimize the 
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Partition C 



Partition T) 






A3 




A2 



Figure 5: Two partitions of the unit square that do not differ much if A3 has low probabihty. 



latest completion time in parallel computing (see Burkard, Dell'Amico and Martello, 2009 
Section 6.2). In the context of clustering, surely the di distance seems the most natural 
choice among all the dp possibilities, due to its clear interpretation. 

Nevertheless, the definition of the di distance involves some kind of redundancy, due to 
the fact that C and D are partitions of W^, because the two disjoint sets that form every 
symmetric difference in fact appear twice in each of the sums in ([T]). More precisely, taking 
into account that P{CAD) = P{C) + P{D) - 2P{C n it follows that for every a eVr 

r r 

PidAD^^i)) = 2 - 2 ^ P(Q n (2) 

i=l i=l 

To avoid this redundancy, the eventual suggestion to measure the distance between C and 
f , based on the set distance in P-measure, is dp(C, D) = ^di{Q,T>). 

If the partitions C and D do not have the same number of clusters, then as many empty 
set components as needed are added so that both partitions include the same number of 



components, as in Charon et al. (2006), and the distance between the extended partitions 
is computed as before. Explicitly, if C = {Ci, . . . , Cr} and D = {Di, . . . , Dg} with r < s 
then, writing Cj = for i = r + 1, . . . , s, we set 

^ s ^ r s 

(ip(e,D) = „ min^P(C,AD,(,)) = „ min{^P(C,AD,(,))+ P(I?.(.))|. 

i=l i=l i=r+l 

Thus, the term X^^^^+i P{Da(i)) can be interpreted as a penalization for unmatched prob- 
ability mass. 

The idea is that two partitions as those shown in Figure [5] do not differ much if ^3 has 
low probability, even if they do not have the same number of clusters. For the partitions in 
Figure [5| denote 6 = {Ci, C2} and T> = {Di, D2, D3} with Ci = Di = Ai, C2 = ^2 U ^3, 
P>2 = A2, P>3 = A3, and assume that P{Ai) = 0.5, P(^2) = 0.45 and P{A3) = 0.05. 
Then, it can be shown that dp{e,V) = 0.05. In every cluster of C is matched to some 
cluster in D, depending on the permutation for which the minimum is achieved. When S 
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has less clusters than T>, some of the components of T) will be matched with the empty set, 
indicating that they do not have an obvious match in C, or that they are unimportant. In 
the previous example, the minimum is achieved when Ci is matched with Di, C2 with D2 
and D3 is matched with the empty set. 

It is interesting to note that dp{Q, T>) can be estimated in a natural way by replacing P 
with the empirical measure based on the data Xi , . . . , X„ , leading to 



dp{e,v) 



1 

— min 

2n aGVs 



T n s n 



1=1 j=l 



i=r+l j=l 



where Ia denotes the indicator function of the set A. When r = s, it follows from ([2]) that 
an alternative expression for dp(C, D) is 



dp{e,v) 

and, therefore, its sample analogue 

Tp{e,v) = 



r 

max > 

1 = 1 



p(CinD,(,)) 



^ lib 



1=1 j=i 

coincides with the so-called classification distance between two clusterings of the data, whose 



Meila 



(2005 



2007 



2012). For r < s, however, dp differs from the 



properties are explored in 
classification distance (which does not include the penalty term) , but it corresponds exactly 



with the transfer distance, studied in detail in Charon et al. (2006) (see also Denoeud 2008). 



Extending the properties of the transfer distance to its population counterpart suggests an 
interpretation of dp (6,2?) as the minimal probability mass that needs to be moved to 
transform the partition C into T>, hence the connection with the optimal transportation 
problem. 



3.2 A distance between sets of sets 

An alternative notion of distance between two clusterings based on the Hausdorff metric 
has been kindly suggested by Professor Antonio Cuevas, noting that precisely this distance 



was used in Pollard (1981) to measure the discrepancy between the set of sample X-means 



and the set of population iC-means. If {X, p) is a metric space and A^BCX are two 
non-empty subsets of X, the Hausdorff distance between A and B is defined as 



dH{A^ B) = max < sup inf p{a, b), sup inf p(a, b) > 



or, equivalent ly, as 



dH{A,B) = inf{e > 0: yl C 5^ and S C ^^|, 
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where = UaeA{^ ^ '■ Pi^^o) < e}, and is defined analogously. 

In the context of clustering, take X to be the metric space consisting of all the sets of 
equipped with the distance p{C,D) = P{CAD), once two sets with P-null symmetric 

difference have been identified to be the same. Then any two clusterings C = {Ci, . . . , Cr} 

and D = {Di, . . . , Ds} can be viewed as (finite) subsets of X, and therefore the Hausdorff 

distance between C and D is defined as 

(i//(e,D) = max] max min P{CiADj), max min P{CiADj)\ 

K i=l,...,r j=l,...,s j=l,...,s i=l,...,r J 

= inf{e > 0: e C and D C C^j. 

To express it in words, dniQ, 2?) < £ whenever for every d G Q there is some Dj G D such 
that P{CiADj) < £ and viceversa. 

The Hausdorff distance can be regarded as a uniform distance between sets. It is not 
hard to show, using standard techniques from the Theory of Normed Spaces, that when 
r = s we have 

dH(e,D) < 2(ip(e,D) < rdH{e,D). 

However, when r < s the distance dn can be more demanding than dp, meaning that both 
partitions have to be really close so that their Hausdorff distance results in a small value. 
For instance, it can be checked that for the two clusterings of the previous example, shown 
in Figure [sj the Hausdorff distance between them is (i//(C,D) = 0.45, mainly due to the 
fact that C2 and D3 are far from each other, since P{C2AD'i) = P{A2) = 0.45. A clear 
picture of the difference between dn and dp is obtained by arranging all the component- 
wise distances P{CiADj) into an r x s matrix. Then, the Hausdorff distance is obtained 
by computing all the row-wise and column-wise minima and taking the maximum of all of 
them. In contrast, for the distance in P-measure the first step when r < s is to add s — r 
row copies of the vector (P(L'i), . . . , P{Ds)) to the matrix of component-wise distances, and 
then compute the distance in P-measure as half the minimum possible sum obtained by 
adding up a different element in each row. As a further difference, note that the Hausdorff 
distance does not involve a matching problem; instead, this distance is solely determined 
by the two components that are furthest from each other. 

Obviously, a sample analogue is also obtained in this case by replacing P for the empirical 
probability measure, leading to 

(1^(6,2?) = -max] max min /c,AD, (Xfc), max min /c,AD, (Xfc) L 

n L «=l,...,r i=l,...,s — ' i=l,...,s ?=l,...,r — ' J 

fc=l fc=l 

which seems not to have been considered previously as a distance between two clusterings 
of the data. 
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Figure 6: Two density functions that are not close but induce exactly the same clustering. 
3.3 Consistency of data-based clusterings 

As indicated above, a data-based clustering is understood as any procedure that induces 
a partition C„ of based on the information obtained from a sample Xi, . . . , X„ from 
a probability distribution P. Once we have a precise definition for the ideal population 
clustering goal C, a data-based clustering is said to be consistent if it gets closer to S 
as the sample size increases. Formally, if d(C„, C) — )• as n — )• c« for some of the modes 
of stochastic convergence (in probability, almost surely, etc), where d represents one of the 
distances between clusterings defined above, or any other sensible alternative. 

A plug-in strategy to obtain data-based clusterings naturally arises when the density / 
is replaced with a smooth estimator /, and so the data-based clusters are defined as the 
domains of attraction of the modes of /. Obvious candidates for the role of / include kernel 
density estimators for the nonparametric setup or mixture model density estimators in a 
parametric context. 

This is a very simple approach, that involves to some extent estimating the density func- 



tion to solve the clustering problem (unsupervised learning) , and according to von Luxburg 



(2004, p. 21) it may not be a good idea because density estimation is a very difficult prob- 
lem, especially in high dimensions. A similar situation is found in the study of classification 
(supervised learning), where the optimal classifier, the Bayes rule, depends on the regres- 
sion function of the random labels over the covariates. Here, even if classification can be 
proved to be a problem easier than regression, nevertheless regression-based algorithms for 
classification play an important role in the development of supervised learning theory (see 



Devroye, Gyorfi and Lugosi 1996| Chapter 6) 



Along the same lines. Figure [6] illustrates why we should not completely discard density 
estimation as an intermediate step for clustering. Figure [6] shows a typical situation where 
the solid line is the true density and the dashed line is a kernel density estimator, since an 
expansion of its pointwise bias shows that on average the kernel estimator underestimates 



the maxima and overestimates the minima (Wand and Jones, 1995, p. 21). But even if the 



two density functions are not really close in any global sense, they produce exactly the same 
clusterings of M, so this also seems to suggest that clustering should be easier than density 
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estimation. Studying consistency and error rates for data-based clusterings obtained from 
density estimators surely represents an interesting open problem. 



4 Practical algorithms 



Apart from the theoretical considerations in the previous sections, for practical purposes it is 
necessary to provide algorithms that help to compute population or data-based clusterings. 
As noted in Section [T| even if the density / is fully known this may become an onerous 
task for some methodologies, as for instance those based on level sets, for which smart 
implementations are needed to compute the connected components of the level sets or the 



cluster tree (Cuevas, Febrero and Fraiman, 2001 Stuetzle and Nugent, 2010) 



With the concept of cluster as the domain of attraction of the density modes, the 
natural approach to compute the partition induced by a density is to use the so-called 
mode seeking algorithms. Surely the most popular algorithm in this class is the mean 



shift algorithm, initially introduced by Fukunaga and Hostetler (1975) and subsequently 



recast by Cheng (1995) and Comaniciu and Meer (2002) to highlight several applications in 



Engineering. The mean shift algorithm is a iterative procedure which, at every step, moves 
the point obtained in the previous iteration to a location of higher density, thus producing 
a convergent sequence that transports any initial value to a local maximum of the density. 
Explicitly, any initial point is transformed recursively to obtain a sequence defined by 



Vj+i = Vj + ^DfiVj)/ fiVj), 



(3) 



where A is a positive-definite matrix chosen to ensure convergence. Notice that if A is a 
positive multiple of the identity matrix then the sequence approximately follows the steepest 
ascent path, since at every step the shift is made along the gradient direction. In this case, 
the mean shift algorithm is easily recognized as a variant of the classical gradient ascent 
algorithm used for numerical maximization, but employing the normalized gradient (the 
gradient divided by /) to accelerate convergence in low-density zones. 

Since this algorithm can be applied to any initial point, it effectively produces a partition 
of W^. However, when a sample Xi, . . . , X„ is available it is common to use every sample 
point as initial value to obtain a clustering of the data. On the other hand, if the sample 
is employed to construct a smooth estimator / of the density, then Equation ^ with / 
replaced by / allows to produce a data-based clustering (hence, also a clustering of the data 
as noted above). 

Moreover, for certain types of density estimators there exist closed forms for the matrix 
A that ensure the convergence of the algorithm to a local maximum. In general, suppose 
the density / corresponds to a normal mixture model, so that 



L 

E 

e=i 



(4) 



17 



where (/)(a;| /x, 5]) = |27r5]|~"'^/^ exp{—^{x — fx)^ 12^^ {x — fx)} refers to the density of the nor- 
mal distribution with mean vector fi and covariance matrix S), and tti, . . . ,ttl are nonneg- 
ative quantities that sum to one. Then, Carreira-Perpifian (2007) showed that the iterative 
algorithm defined by ([s]) is convergent for every initial value x/q by taking A = Aj = S*(yj), 
where = {"^£=1 Oii{y)YlJ^}~^ is a weighted harmonic mean of the covariance ma- 

trices of the components, with weights a£{y) = 'ir£(j){y\fj,£,'Si)/ f{y) given by the posterior 
odds of the ith component of the mixture after observing y. 

Hence, in the normal mixture model where the density estimator / is obtained by 
replacing the parameters in Q with estimates tti, fi^, Sl^, the mean shift algorithm with A = 
^*{yj) the previous definition, but also with the parameters replaced by estimates) 
provides a computationally simple way to obtain the corresponding data-based clustering. 
Some examples of the partitions thus obtained, for normal mixture densities taken from 
Wand and Jones (1993) and Ray and Lindsay (2005), are shown in Figure [T] 



Furthermore, suppose that a nonparametric kernel density estimator /h is employed in 
3j), namely /h(^) = SILi -^h(2; — Xj), where the kernel K is a fixed density function, 
H is a positive-definite bandwidth matrix and Kii{x) = |H|~^/^Er(H~^/^a;). If the kernel 
K is taken as the normal density, then fni^) = "121=1 n^^^Pi^l^i: H), and so its structure 
as a normal mixture density allows to conclude as before that the mean shift algorithm is 
convergent for any initial point by just taking A = H. More generally, proceeding as in 
the proof of Theorem 1 in Comaniciu and Meer ( 2002 ) it can be showed that convergence 
is guaranteed with A = H for a general kernel of the form K{x) = k{x~^x), as long as the 
profile : [0, oo) — )• M is convex and monotonically decreasing. 

Many variants of the mean shift algorithm have been explored in the literature, expand- 
ing in different directions. In the nonparametric context, the iterative procedure ([s]) with 
A = H as above can be explicitly written as yj+i = Y2i=i Wi{yj)X.i, hence taking the form 
of a weighted mean of the observations, with the weight functions defined as 



GH(y-x, 



T:=lGu{y-^^y 

where the kernel G{x) oc g{x^x) is based on the positive profile g = —k'. Sheik, Khan and 
Kanade (2007) provided an alternative formulation of ([s]) by noting that 



Uj+i = argmm 



y\\2 



^ GH(Xj 



(5) 



where ||a;||2 = {x^ x)'^/'^ denotes the usual Euclidean distance, and proposed the medioid- 
shift algorithm by restricting the minimization in ([s]) to the set of sample points, thus 
reducing the computation time since only one iteration of the algorithm has to be computed 
per data point because y 



Xfc for some k G { 1 , . . . , n} . Accelerating the mean shift 
algorithm is in fact one of the expansion directions that has generated much research. A 



18 




1 1 




1 1 



Figure 7: Examples showing the ideal population clustering for several bivariate normal 
mixture densities, calculated using the mean shift algorithm. 



(clearly non-exhaustive) list of references on this topic would include Carreira-Perpihan 
p006l ), [WaEg et a/.|p007| ) and |Vedaldi and Soatto| ( pOOSl ). 

Other variants of the mean shift algorithm are oriented towards robustness, by con- 



sidering median shifts instead of mean shifts. Wang, Qiu and Zamar (2007) proposed the 



algorithm CLUES that at each iteration shifts the previous point to the coordinate-wise 



(i.e., marginal) median of its nearest neighbors and Shapira, Avidan and Shamir (2009) 
explored other possibilities using the different concepts of multivariate median related to 
depth measures. A further refinement of clues, with improved computational efficiency, 
and also based on local medians is the attractors algorithm of |Peha, Viladomat and Za- 
(2012). In fact, it would be possible to modify ([S]) to obtain shift algorithms based on 



mar 



other norm-based local medians, which would be the local analogues to the global proposals 
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studied in Dodge and Rousson (1999) (see also Small, 1990). 



Indeed, a geometric median shift algorithm over Riemannian manifolds was introduced 



m 



Wang and Huang (2010), and its applications to clustering were explored in Wang, 



Huang and Wu (2013). In this respect, the adaptation of shift algorithms to manifolds 



does represent a third major avenue of active research on the topic. For instance, among 



some other recent papers dealing with this subject we could cite Subbarao and Meer (2009), 



Ozertem and Erdogmus (2011) and Genovese et al. (2012). 



5 Conclusions and directions for further research 



At the time of comparing different clustering procedures it is necessary to have a "ground 
truth", or population goal, that represents the ideal clustering to which the clustering 
algorithms should try to get close. Sometimes this ideal population clustering is not so 
easy to specify and it depends on the notion of cluster in which the researcher is interested. 
Here, the population goal for modal clustering is accurately identified, making use of some 
tools from Morse theory, as the partition of the space induced by the domains of attraction 
of the local maxima of the density function. 

This definition needs the probability density to be smooth to a certain degree, specifically 
it must be a Morse function. It would be appealing to extend this notion to density functions 
that are not Morse functions, meaning either that they are smooth but have degenerate 
critical points or even that they are not differentiable. To treat the first case, it might be 
useful to resort to the theory of singularities of differential mappings, which is exhaustively 



covered in the book by Arnold et al. ( 1998 ), for instance. On the other hand, the study of the 



non-smooth case might start from Agrachev, Pallaschke and Scholtes (1997), where Morse 



theory for piecewise smooth functions is presented. Here, the key role would be played by 
the subgradient, which generalizes the concept of gradient for non-smooth functions. 

Another interesting challenge consists of studying the choice of the parameters for the 
density estimators (the bandwidth for kernel estimators, the mixture parameters for mix- 
ture model estimators) that minimize the distance between the corresponding data-based 
clustering and the true population clustering, as measured by any of the distance between 
clusterings discussed in Section [3} Or, perhaps even better, to develop methods aimed 
to perform modal clustering that do not necessarily rely on a pilot density estimate, as it 
happens for classification methods, which are not necessarily based on a regression estimate. 

On the algorithmic side, the alternative formulation ^ of the mean shift algorithm 



by Sheik, Khan and Kanade (2007) surely deserves further attention. Its resemblance to 



locally weighted regression methodology (Ruppert and Wand, 1994), identifying somehow 



mean shift as a locally constant adjustment, suggests that perhaps a local linear mean 
shift could improve the performance of clustering based on kernel density estimation, in the 
same way as local linear regression improves over the Nadaraya- Watson (locally constant) 
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estimator. 
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